Before embarking on developing statistical models and generating predictions, it is essential to understand your data. This is typically done using conventional numerical and graphical methods. John Tukey (Tukey, 1977) advocated the practice of exploratory data analysis (EDA) as a critical part of the scientific process.
“No catalog of techniques can convey a willingness to look for what can be seen, whether or not anticipated. Yet this is at the heart of exploratory data analysis. The graph paper and transparencies are there, not as a technique, but rather as a recognition that the picture examining eye is the best finder we have of the wholly unanticipated.”
Fortunately, we can dispense with the graph paper and transparencies and use software that makes routine work of developing the ‘pictures’ (i.e., graphical output) and descriptive statistics needed to explore our data.
Descriptive statistics include:
Graphical methods include:
Before you start an EDA, you should inspect your data and correct all typos and blatent errors. EDA can then be used to identify additional errors such as outliers and help you determine appropriate statistical analyses. For this chapter we’ll use the loafercreek dataset from the CA630 Soil Survey Area.
library(soilDB)
# Load from the the loakercreek dataset
data("loafercreek")
# Generalized the horizon designations
n <- c("A",
"BAt",
"Bt1",
"Bt2",
"Cr",
"R")
# REGEX rules
p <- c("A",
"BA|AB",
"Bt|Bw",
"Bt3|Bt4|2B|C",
"Cr",
"R")
# Compute genhz labels and add to loafercreek dataset
loafercreek$genhz <- generalize.hz(loafercreek$hzname, n, p)
# Extract the horizon table
h <- horizons(loafercreek)
# Extract the site table
s <- site(loafercreek)
# Examine the matching of pairing of the genhz label to the hznames
with(h, table(genhz, hzname))
## hzname
## genhz 2BCt 2Bt1 2Bt2 2Bt3 2Bt4 2Bt5 2CB 2CBt 2Cr 2Crt 2R A A1 A2
## A 0 0 0 0 0 0 0 0 0 0 0 106 7 7
## BAt 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Bt1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Bt2 1 2 7 9 7 1 1 1 0 0 0 0 0 0
## Cr 0 0 0 0 0 0 0 0 5 2 0 0 0 0
## R 0 0 0 0 0 0 0 0 0 0 5 0 0 0
## not-used 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## hzname
## genhz AB ABt Ad Ap B BA BAt BC BCt Bt Bt1 Bt2 Bt3 Bt4 Bw Bw1
## A 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## BAt 1 0 0 0 0 33 8 0 0 0 0 0 0 0 0 0
## Bt1 0 4 0 0 0 0 0 0 0 9 102 97 0 0 10 2
## Bt2 0 0 0 0 0 0 0 5 17 0 0 0 50 8 0 0
## Cr 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## not-used 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## hzname
## genhz Bw2 Bw3 C CBt Cd Cr Cr/R Crt H1 Oi R Rt
## A 0 0 0 0 0 0 0 0 0 0 0 0
## BAt 0 0 0 0 0 0 0 0 0 0 0 0
## Bt1 2 1 0 0 0 0 0 0 0 0 0 0
## Bt2 0 0 6 6 1 0 0 0 0 0 0 0
## Cr 0 0 0 0 0 53 0 24 0 0 0 0
## R 0 0 0 0 0 0 1 0 0 0 44 2
## not-used 0 0 0 0 0 0 0 0 1 27 0 0
As noted in Chapter 1, a visual examination of the raw data is possible by looking at the R object:
View(h)
# or by clicking on the dataset in the environment tab
This view is fine for a small dataset, but can be cumbersome for larger ones. The summary() function can be used to quickly summarize a dataset however, even for our small example dataset, the output can be voluminous. Therefore in the interest of saving space we’ll only look at a sample of columns.
vars <- c("genhz", "clay", "total_frags_pct", "phfield", "effclass")
summary(h[vars])
## genhz clay total_frags_pct phfield
## A :122 Min. :10.00 Min. : 0.00 Min. :4.900
## BAt : 42 1st Qu.:18.00 1st Qu.: 0.00 1st Qu.:6.000
## Bt1 :227 Median :22.00 Median : 5.00 Median :6.200
## Bt2 :122 Mean :23.52 Mean :14.18 Mean :6.163
## Cr : 84 3rd Qu.:28.00 3rd Qu.:20.00 3rd Qu.:6.500
## R : 52 Max. :60.00 Max. :95.00 Max. :7.000
## not-used: 29 NA's :189 NA's :422
## effclass
## very slight: 0
## slight : 0
## strong : 0
## violent : 0
## none : 94
## NA's :584
##
# or
# sub <- subset(h, select = vars)
# summary(sub)
The summary() function is known as a generic R function. It will return a preprogrammed summary for any R object. Because h is a data frame, we get a summary of each column. Factors will be summarized by their frequency (i.e., number of observations), while numeric or integer variables will print out a five number summary, and characters simply print their length. The number of missing observations for any variable will also be printed if they are present. If any of these metrics look unfamiliar to you, don’t worry we’ll cover them shortly.
When you do have missing data and the function you want to run will not run with missing values, the following options are available:
na.exclude(), such as h2 <- na.exclude(h). However this can be wasteful because it removes all rows (e.g., horizons), regardless if the row only has 1 missing value. Instead it’s sometimes best to create a temporary copy of the variable in question and then remove the missing variables, such as clay <- na.exclude(h$clay).h$clay <- ifelse(is.na(h$clay), 0, h$clay) # or h[is.na(h$clay), ] <- 0.na.rm.A quick check for typos would be to examine the list of levels for a factor or character, such as:
# just for factors
levels(h$genhz)
## [1] "A" "BAt" "Bt1" "Bt2" "Cr" "R"
## [7] "not-used"
# for characters and factors
# unique(h$hzname)
# sort unique hznames
sort(unique(h$hzname))
## [1] "2BCt" "2Bt1" "2Bt2" "2Bt3" "2Bt4" "2Bt5" "2CB" "2CBt" "2Cr" "2Crt"
## [11] "2R" "A" "A1" "A2" "AB" "ABt" "Ad" "Ap" "B" "BA"
## [21] "BAt" "BC" "BCt" "Bt" "Bt1" "Bt2" "Bt3" "Bt4" "Bw" "Bw1"
## [31] "Bw2" "Bw3" "C" "CBt" "Cd" "Cr" "Cr/R" "Crt" "H1" "Oi"
## [41] "R" "Rt"
If the unique() function returned typos such as “BT” or “B t”, you could either fix your original dataset or you could make an adjustment in R, such as:
h$hzname <- ifelse(h$hzname == "BT", "Bt", h$hzname)
# or
# h$hzname[h$hzname == "BT"] <- "Bt"
# or as a last resort we could manually edit the spreadsheet in R
# edit(h)
Typo errors such as these are a common problem with old pedon data in NASIS.
# gopheridge rules
n <- c('A', 'Bt1', 'Bt2', 'Bt3','Cr','R')
p <- c('^A|BA$', 'Bt1|Bw','Bt$|Bt2', 'Bt3|CBt$|BCt','Cr','R')
| Parameter | NASIS | Description | R function |
|---|---|---|---|
| Mean | RV | arithmetic average | mean() |
| Median | RV | middle value, 50% quantile | median() |
| Mode | RV | most frequent value | sort(table(), decreasting = TRUE)[1] |
| Standard Deviation | L & H | variation around mean | sd() |
| Quantiles | L & H | percent rank of values, such that all values are <= p | quantile() |
These measures are used to determine the mid-point of the range of observed values. In NASIS speak this should ideally be equivalent to the representative value (RV) for numeric and integer data. The mean and median are the most commonly used measures for our purposes.
Mean - is the arithmetic average all are familiar with, formally expressed as: which sums ( \(\sum\) ) all the X values in the sample and divides by the number (n) of samples. It is assumed that all references in this document refer to samples rather than a population.
The mean clay content from the loafercreek dataset may be determined:
clay <- na.exclude(h$clay) # first remove missing values and create a new vector
mean(clay)
## [1] 23.52372
# or use the additional na.rm argument
mean(h$clay, na.rm = TRUE)
## [1] 23.52372
Median is the middle measurement of a sample set, and as such is a more robust estimate of central tendency than the mean. This is known as the middle or 50th quantile, meaning there are an equal number of samples with values less than and greater than the median. For example, assuming there are 21 samples, sorted in ascending order, the median would be the 11th sample.
The median from the sample dataset may be determined:
median(clay)
## [1] 22
Mode - is the most frequent measurement in the sample. The use of mode is typically reserved for factors, which we will discuss shortly. One issue with using the mode for numeric data is that the data need to be rounded to the level of desired precision. In following example you can see where it looks like someone entered the laboratory data into the pedon horizon data. It’s hard to tell from the example, but the first column in each sequence shows the values and the following columns show the number occurrences for that value.
table(h$clay) # show a frequency table
##
## 10 11 12 13
## 4 4 9 12
## 14 15 15.1000003814697 16
## 12 24 1 28
## 16.7000007629395 17 18 18.5
## 1 15 38 1
## 19 19.2999992370605 19.6000003814697 20
## 15 1 1 38
## 21 21.3999996185303 21.5 21.7000007629395
## 15 1 1 1
## 22 23 23.3999996185303 23.6000003814697
## 25 13 1 1
## 24 25 25.2000007629395 25.3999996185303
## 20 44 1 1
## 26 27 28 29
## 21 9 29 9
## 30 31 32 32.2999992370605
## 17 2 20 1
## 33 34 35 36
## 8 5 10 2
## 37 38 39 40
## 3 2 1 3
## 41 42 43 44
## 1 3 2 1
## 45 48 50 51.4000015258789
## 3 1 6 1
## 60
## 1
# we can fix the rounding error like so
# table(round(h$clay))
# or
# table(as.integer(h$clay))
sort(table(round(h$clay)), decreasing = TRUE)[1] # sort and select the 1st value, which will be the mode
## 25
## 46
Frequencies
To summarize factors and characters we can examine their frequency or number of observations. This is accomplished using the table() function.
table(h$genhz)
##
## A BAt Bt1 Bt2 Cr R not-used
## 122 42 227 122 84 52 29
# or
# summary(h$genhz)
This gives us a count of the number of observations for each horizon. If we want to see the comparison between two different factors or characters, we can include two variables.
table(h$genhz, h$hzname)
##
## 2BCt 2Bt1 2Bt2 2Bt3 2Bt4 2Bt5 2CB 2CBt 2Cr 2Crt 2R A A1 A2
## A 0 0 0 0 0 0 0 0 0 0 0 106 7 7
## BAt 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Bt1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Bt2 1 2 7 9 7 1 1 1 0 0 0 0 0 0
## Cr 0 0 0 0 0 0 0 0 5 2 0 0 0 0
## R 0 0 0 0 0 0 0 0 0 0 5 0 0 0
## not-used 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## AB ABt Ad Ap B BA BAt BC BCt Bt Bt1 Bt2 Bt3 Bt4 Bw Bw1
## A 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## BAt 1 0 0 0 0 33 8 0 0 0 0 0 0 0 0 0
## Bt1 0 4 0 0 0 0 0 0 0 9 102 97 0 0 10 2
## Bt2 0 0 0 0 0 0 0 5 17 0 0 0 50 8 0 0
## Cr 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## not-used 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
##
## Bw2 Bw3 C CBt Cd Cr Cr/R Crt H1 Oi R Rt
## A 0 0 0 0 0 0 0 0 0 0 0 0
## BAt 0 0 0 0 0 0 0 0 0 0 0 0
## Bt1 2 1 0 0 0 0 0 0 0 0 0 0
## Bt2 0 0 6 6 1 0 0 0 0 0 0 0
## Cr 0 0 0 0 0 53 0 24 0 0 0 0
## R 0 0 0 0 0 0 1 0 0 0 44 2
## not-used 0 0 0 0 0 0 0 0 1 27 0 0
table(h$genhz, h$texture_class)
##
## br c cb cl gr l mpm pg scl sic sicl sil sl spm
## A 0 0 0 0 0 87 0 0 0 0 0 27 6 0
## BAt 0 0 0 2 0 33 0 0 0 0 1 4 1 0
## Bt1 0 2 0 47 0 139 0 0 6 1 7 20 1 0
## Bt2 0 16 1 54 2 32 0 1 5 1 3 5 0 0
## Cr 69 1 0 0 0 0 0 0 0 0 0 0 0 0
## R 46 0 0 0 0 0 0 0 0 0 0 0 0 0
## not-used 0 0 0 1 0 0 1 0 0 0 0 0 0 24
We can also easily add margin totals to the table or convert the table frequencies to proportions.
# append the table with row and column sums
addmargins(table(h$genhz, h$texture_class))
##
## br c cb cl gr l mpm pg scl sic sicl sil sl spm Sum
## A 0 0 0 0 0 87 0 0 0 0 0 27 6 0 120
## BAt 0 0 0 2 0 33 0 0 0 0 1 4 1 0 41
## Bt1 0 2 0 47 0 139 0 0 6 1 7 20 1 0 223
## Bt2 0 16 1 54 2 32 0 1 5 1 3 5 0 0 120
## Cr 69 1 0 0 0 0 0 0 0 0 0 0 0 0 70
## R 46 0 0 0 0 0 0 0 0 0 0 0 0 0 46
## not-used 0 0 0 1 0 0 1 0 0 0 0 0 0 24 26
## Sum 115 19 1 104 2 291 1 1 11 2 11 56 8 24 646
# calculate the proportions relative to the rows, margin = 1 calculates for rows, margin = 2 calculates for columns, margin = NULL calculates for total observations
round(prop.table(table(h$genhz, h$texture_class), margin = 1) * 100)
##
## br c cb cl gr l mpm pg scl sic sicl sil sl spm
## A 0 0 0 0 0 72 0 0 0 0 0 22 5 0
## BAt 0 0 0 5 0 80 0 0 0 0 2 10 2 0
## Bt1 0 1 0 21 0 62 0 0 3 0 3 9 0 0
## Bt2 0 13 1 45 2 27 0 1 4 1 2 4 0 0
## Cr 99 1 0 0 0 0 0 0 0 0 0 0 0 0
## R 100 0 0 0 0 0 0 0 0 0 0 0 0 0
## not-used 0 0 0 4 0 0 4 0 0 0 0 0 0 92
To determine the mean by a group or category, use the aggregate command:
aggregate(clay ~ genhz, data = h, mean)
## genhz clay
## 1 A 16.16957
## 2 BAt 19.45789
## 3 Bt1 24.07477
## 4 Bt2 31.11880
## 5 Cr 15.00000
To determine the median by group or category, use the aggregate command again:
aggregate(clay ~ genhz, data = h, median)
## genhz clay
## 1 A 16.0
## 2 BAt 18.5
## 3 Bt1 24.0
## 4 Bt2 30.0
## 5 Cr 15.0
# or we could use the summary() function to get both the mean and median
aggregate(clay ~ genhz, data = h, summary)
## genhz clay.Min. clay.1st Qu. clay.Median clay.Mean clay.3rd Qu.
## 1 A 10.00000 14.00000 16.00000 16.16957 18.00000
## 2 BAt 14.00000 17.25000 18.50000 19.45789 20.00000
## 3 Bt1 12.00000 20.00000 24.00000 24.07477 27.00000
## 4 Bt2 10.00000 26.00000 30.00000 31.11880 35.00000
## 5 Cr 15.00000 15.00000 15.00000 15.00000 15.00000
## clay.Max.
## 1 25.00000
## 2 28.00000
## 3 51.40000
## 4 60.00000
## 5 15.00000
These are measures used to determine the spread of values around the mid-point. This is useful to determine if the samples are spread widely across the range of observations or concentrated near the mid-point. In NASIS speak these values might equate to the low (L) and high (H) values for numeric and integer data.
Variance is a positive value indicating deviation from the mean:
This is the square of the sum of the deviations from the mean, divided by the number of samples minus 1. It is commonly referred to as the sum of squares. As the deviation increases, the variance increases. Conversely, if there is no deviation, the variance will equal 0. As a squared value, variance is always positive. Variance is an important component for many statistical analyses including the most commonly referred to measure of dispersion, the standard deviation. Variance for the sample dataset is:
var(clay)
## [1] 62.44497
Standard Deviation is the square root of the variance:
The units of the standard deviation are the same as the units measured. From the formula you can see that the standard deviation is simply the square root of the variance. Standard deviation for the sample dataset is:
sd(clay)
## [1] 7.902213
# or
# sqrt(var(clay))
Coefficient of Variation (CV) is a relative (i.e., unitless) measure of standard deviation:
CV is calculated by dividing the standard deviation by the mean and multiplying by 100. Since standard deviation varies in magnitude with the value of the mean, the CV is useful for comparing relative variation amongst different datasets. However Webster (2001) discourages using CV to compare different variables. Webster (2001) also stresses that CV is reserved for variables that have an absolute 0, like clay content. CV may be calculated for the sample dataset as:
cv <- sd(clay) / mean(clay) * 100
cv
## [1] 33.59253
Quantiles (a.k.a. Percentiles) - the percentile is the value that cuts off the first nth percent of the data values when sorted in ascending order.
The default for the quantile() function returns the min, 25th percentile, median or 50th percentile, 75th percentile, and max, known as the five number summary originally proposed by Tukey. Other probabilities however can be used. At present the 5th, 50th, and 95th are being proposed for determining the range in characteristics (RIC) for a given soil property.
quantile(clay)
## 0% 25% 50% 75% 100%
## 10 18 22 28 60
# or
quantile(clay, c(0.05, 0.5, 0.95))
## 5% 50% 95%
## 13.0 22.0 37.6
Thus, for the five number summary 25% of the observations fall between each of the intervals. Quantiles are a useful metric because they are largely unaffected by the distribution of the data, and have a simple interpretation.
Range is the difference between the highest and lowest measurement of a group. Using the sample data it may be determined as:
range(clay)
## [1] 10 60
which returns the minimum and maximum values observed, or:
diff(range(clay))
## [1] 50
# or
# max(clay) - min(clay)
which returns the value of the range
Interquartile Range (IQR) is the range from the upper (75%) quartile to the lower (25%) quartile. This represents 50% of the observations occurring in the mid-range of a sample. IQR is a robust measure of dispersion, unaffected by the distribution of data. In soil survey lingo you could consider the IQR to estimate the central concept of a soil property. IQR may be calculated for the sample dataset as:
IQR(clay)
## [1] 10
# or
# diff(quantile(clay, c(0.25, 0.75)))
Now that we’ve checked for missing values and typos and made corrections, we can graphically examine the sample data distribution of our integer and numeric data. Frequency distributions are useful because they can help us visualize the center (e.g., RV) and spread or dispersion (e.g., low and high) of our data. Typically in introductory statistics the normal (i.e., Gaussian) distribution is emphasized.
| Plot Types | Description |
|---|---|
| Bar | a plot where each bar represents the frequency of observations for a ‘group’ |
| Histogram | a plot where each bar represents the frequency of observations for a ‘given range of values’ |
| Density | an estimation of the frequency distribution based on the sample data |
| Quantile-Quantile | a plot of the actual data values against a normal distribution |
| Box-Whisker | a visual representation of median, quartiles, symmetry, skewness, and outliers |
| Scatter & Line | a graphical display of one variable plotted on the x axis and another on the y axis |
figs <- data.frame(
'Plot Types' = c("Bar", "Histogram", "Density", "Quantile-Quantile", "Box-Whisker", "Scatter & Line"),
# Description = c("a bar plot where each bar represents the frequency of observations for a given range of values",
# "an estimation of the frequency distribution based on the sample data",
# "a plot of the actual data values against a normal distribution",
# "a visual representation of median, quartiles, symmetry, skewness, and outliers",
# "a graphical display of one variable plotted on the x axis and another on the y axis",
# "plots formatted for the representation of circular data"
# ),
'Base R' = c("barplot()", "hist()", "plot(density())", "qqnorm()", "boxplot()", "plot()"),
'lattice' = c("barchart()", "histogram()", "densityplot()", "qq()", "bwplot()", "xyplot"),
'ggplot geoms' = c("geom_bar()", "geom_histogram()", "geom_density()", "geom_qq()", "geom_boxplot()", "geom_point()"),
check.names = FALSE
)
knitr::kable(figs)
| Plot Types | Base R | lattice | ggplot geoms |
|---|---|---|---|
| Bar | barplot() | barchart() | geom_bar() |
| Histogram | hist() | histogram() | geom_histogram() |
| Density | plot(density()) | densityplot() | geom_density() |
| Quantile-Quantile | qqnorm() | qq() | geom_qq() |
| Box-Whisker | boxplot() | bwplot() | geom_boxplot() |
| Scatter & Line | plot() | xyplot | geom_point() |
What is a normal distribution and why should you care? Many statistical methods are based on the properties of a normal distribution. Applying certain methods to data that are not normally distributed can give misleading or incorrect results. Most methods that assume normality are robust enough for all data except the very abnormal. This section is not meant to be a recipe for decision making, but more an extension of tools available to help you examine your data and proceed accordingly. The impact of normality is most commonly seen for parameters used by pedologists for documenting the ranges of a variable (i.e., Low, RV and High values). Often a rule-of thumb similar to: “two standard deviations” is used to define the low and high values of a variable. This is fine if the data are normally distributed. However, if the data are skewed, using the standard deviation as a parameter does not provide useful information of the data distribution. The quantitative measures of Kurtosis (peakedness) and Skewness (symmetry) can be used to assist in accessing normality and can be found in the fBasics package, but Webster (2001) cautions against using significance tests for assessing normality. The preceding sections and chapters will demonstrate various methods to cope with alternative distributions.
A Gaussian distribution is often referred to as “Bell Curve”, and has the following properties (Lane):
Viewing a histogram or density plot of your data provides a quick visual reference for determining normality as discussed in section 4.1. Distributions are typically normal, Bimodal or Skewed:
Figure 2. Sample histograms
Occasionally distributions are Uniform, or nearly so:
Figure 3. Uniform distribution
In this example the mean and median are only slightly different, so we can safely assume that we have a normal distribution. However many soil variables often have a non-normal distribution. For example, let’s look at graphical examination of the mean vs. median for clay and rock fragments:
amean = mean(clay)
amed <- median(clay)
library(ggplot2)
ggplot(h, aes(x = clay)) +
geom_density() +
geom_vline(xintercept = amean, col = "blue") +
geom_vline(xintercept = amed, col = "orange") +
ggtitle("Comparison of the Mean (Blue) and Median (Orange) for Clay Content")
amean <- mean(h$total_frags_pct)
amed <- median(h$total_frags_pct)
ggplot(h, aes(x = total_frags_pct)) +
geom_density() +
geom_vline(xintercept = amean, col = "blue") +
geom_vline(xintercept = amed, col = "orange") +
ggtitle("Comparison of the Mean (Blue) and Median (Orange) for Rock Fragments")
Figure 7. Comparison of the mean vs median for clay and rock fragments.
The blue vertical line represents the breakpoint for the median and the orange represents the mean. The median is a more robust measure of central tendency compared to the mean. In order for the mean to be a useful measure, the data distribution must be approximately normal. The further the data departs from normality, the less meaningful the mean becomes. The median always represents the same thing independent of the data distribution, namely, 50% of the samples are below and 50% are above the median. The example from Figure 7 for clay again indicates that distribution is approximately normal. However for rock fragments, we see a long tailed distribution (e.g., skewed). Using the mean in this instance would overestimate the rock fragments. Although in this instance the difference between the mean and median is only 9 percent.
Next we’ll review the ggplot2 geom_histogram() and geom_density() functions which plot a histogram and density curve respectively.
library(ggplot2)
ggplot(h, aes(x = clay)) +
geom_histogram()
# or
ggplot(h, aes(x = clay)) +
geom_histogram() +
facet_wrap(~ genhz)
# or using the graphics package
# hist(h$clay, col = "grey")
Figure 4. Histogram
Since histograms are dependent on the number of bins, for small datasets they’re not the best method of determining the shape of a distribution. A density estimation, also known as a Kernel density plot, generally provides a better visualization of the shape of the distribution in comparison to the histogram.
ggplot(h, aes(x = clay, y = ..density..)) +
geom_histogram() +
geom_density() +
facet_wrap(~ genhz)
ggplot(h, aes(x = clay)) +
geom_density()
# or using the graphics package
# test <- density(h$clay, na.rm = TRUE)
# plot(test)
Figure 5. The Kernel density plot depicts a smoothed line of the distribution
Compared to the histogram where the y-axis represents the number or percent (i.e., frequency) of observations, the y-axis for the density plot represents the probability of observing any given value, such that the area under the curve equals one. Notice how the histogram of clay seems to emphasis the long right tail of values, which we might question as to whether we have a normal distribution. One curious feature of the density curve is the hint of a bimodal distribution. Given that our sample includes a mixture of surface and subsurface horizons, we may have two different populations. However considering how much the two distributions overlap, it seems impractical to separate them in this instance.
Box plots are a graphical representation of the five number summary, depicting quartiles, minimum, maximum and outliers (if present). Boxplots convey the shape of the data distribution, the presence of extreme values, and the ability to compare with other variables using the same scale, providing an excellent tool for screening data quality, determining thresholds for variables and developing working hypotheses.
The parts of the boxplot are shown in Figure 8. The “box” of the boxplot is defined as the 1st quartile, (Q1 in the figure) and the 3rd quartile, (Q3 in the figure). The median, or 2nd quartile, is the dark line in the box. The whiskers show data that is 1.5 * IQR above and below the 3rd and 1st quartile. Any data point that is beyond a whisker is considered an outlier.
That is not to say the points are in error, just that they are extreme compared to the rest of the dataset. However, you may want to evaluate these points to ensure that they are correct.
R GUI image
Figure 8. Boxplot description (Seltman, 2009)
A boxplot of sand content by horizon may be made for the sample dataset as:
ggplot(h, (aes(x = genhz, y = clay))) +
geom_boxplot()
# or using the graphics package
# boxplot(clay ~ genhz, xlab = "Horizon", ylab="Clay (%)", data = h)
Figure 9. Box plot of clay by horizon
The xlab and ylab parameters control the titles of the x and y axis.
The above box plot shows a steady increase in clay content with depth. Notice the outliers in the box plots, identified by the individual circles.
A QQplot is a plot of the actual data values against a normal distribution (which has a mean of 0 and standard deviation of 1).
A QQplot of sand content may be made for the sample dataset as:
ggplot(h, aes(sample = clay)) +
geom_qq() +
ggtitle("Normal Q-Q Plot for Clay")
ggplot(h, aes(sample = total_frags_pct)) +
geom_qq() +
ggtitle("Normal Q-Q Plot for Frags")
Figure 6. QQplot
The line represents the quantiles of a normal distribution. If the data set is perfectly normal, the data points will fall along the line. Overall this plot shows that our clay example is more or less normally distributed. However the second plot again shows that our rock fragments are far from normally distributed.
A more detailed explanation of QQplots may be found on Wikipedia:
https://en.wikipedia.org/wiki/QQ_plot
A correlation matrix is a table of the calculated correlation coefficients of all variables. This provides a quantitative measure to guide the decision making process. The following will produce a correlation matrix for the sp4 dataset:
h$hzdepm <- with(h, (hzdepb + hzdept) / 2) # Compute the middle horizon depth
vars <- c("hzdepm", "clay", "sand", "total_frags_pct", "phfield")
round(cor(h[vars], use = "complete.obs"), 2)
## hzdepm clay sand total_frags_pct phfield
## hzdepm 1.00 0.59 -0.08 0.50 -0.03
## clay 0.59 1.00 -0.17 0.26 0.13
## sand -0.08 -0.17 1.00 -0.02 0.11
## total_frags_pct 0.50 0.26 -0.02 1.00 -0.15
## phfield -0.03 0.13 0.11 -0.15 1.00
As seen in the output, variables are perfectly correlated with themselves and have a correlation coefficient of 1.0. Negative values indicate a negative relationship between variables. What is considered highly correlated? A good rule of thumb is anything with a value of 0.7 or greater is considered highly correlated.
Plotting points of one ratio or interval variable against another is a scatter plot. Plots can be produced for a single or multiple pairs of variables. Many independent variables are often under consideration in soil survey work. This is especially common when GIS is used, which offers the potential to correlate soil attributes with a large variety of raster datasets.
The purpose of a scatterplot is to see how one variable relates to another. With modeling in general the goal is parsimony (i.e., simple). The goal is to determine the fewest number of variables required to explain or describe a relationship. If two variables explain the same thing, i.e., they are highly correlated, only one variable is needed. The scatterplot provides a perfect visual reference for this.
Create a basic scatter plot using the loafercreek dataset.
ggplot(h, aes(x = clay, y = hzdepm)) +
geom_point()
# or
# plot(clay ~ hzdepm, data = h)
Figure 10. Scatter Plot
This plots clay on the X axis and depth on the X axis. As shown in Figure 10, there is a moderate correlation between these variables.
The function below produces a scatterplot matrix for all the numeric variables in the dataset. This is a good command to use for determining rough linear correlations for continuous variables.
# Load the GGally package
library(GGally)
# Create a scatter plot matrix
vars <- c("hzdepm", "clay", "phfield", "total_frags_pct")
GGally::ggpairs(h[vars])
# or using base graphics
## Create a scatter plot matrix
# pairs(h[vars])
Slope aspect and pH are two common variables warranting special consideration for pedologists.
Since pH has a logarithmic distribution, the use of median and quantile ranges are the preferred measures when summarizing pH. Remember, pHs of 6 and 5 correspond to hydrogen ion concentrations of 0.000001 and 0.00001 respectively. The actual average is 5.26; -log10((0.000001 + 0.00001) / 2). If no conversions are made for pH, the mean and sd in the summary are considered the geometric mean and sd, not the arithmetic. The wider the pH range, the greater the difference between the geometric and arithmetic mean. The difference between the correct average of 5.26 and the incorrect of 5.5 is small, but proper handling of data types is a best practice.
If you have a table with pH values and wish to calculate the arithmetic mean using R, this example will work:
# arithematic mean
-log10(mean(1/10^-h$phfield, na.rm = TRUE))
## [1] -6.35647
# geometric mean
mean(h$phfield, na.rm = TRUE)
## [1] 6.163281
If there is a need to create a surface of pH values, i.e., interpolate values from point observations, the operation of determining values at unknown points is analogous to determining an average and the use of hydrogen ion concentration would be the proper input.
If spatial interpolation is going to be performed, the following steps should be executed:
Here is a brief example for interpolating pH using common software:
This is a workaround for ArcGIS, which truncates data that are extremely small like the H+ concentration for pH > 7.
R GUI image
R GUI image
Opening the table for the Event layer:
R GUI image
R GUI image
R GUI image
R GUI image
Slope aspect - requires the use of circular statistics for summarizing numerically, or graphical interpretation using circular plots. For example, if soil map units being summarized have a uniform distribution of slope aspects ranging from 335 degrees to 25 degrees, the Zonal Statistics tool in ArcGIS would return a mean of 180.
The most intuitive means available for evaluating and describing slope aspect are circular plots available with the circular package in R and the radial plot option in the TEUI Toolkit. The circular package in R will also calculate circular statistics like mean, median, quartiles etc.
library(circular)
aspect <- s$aspect_field
aspect <- circular(aspect, template="geographic", units="degrees", modulo="2pi")
summary(aspect)
## n Min. 1st Qu. Median Mean 3rd Qu. Max. Rho
## 110.0000 5.0000 253.8000 190.0000 199.7000 104.8000 12.0000 0.1481
## NA's
## 5.0000
The numeric output is fine, but a following graphic is more revealing, which shows the dominant Southwest slope aspect.
rose.diag(aspect, bins = 8, col="grey")
Figure 11. Rose Diagram
One of the strengths of NASIS is that it that has a wealth of existing queries and reports. This makes it easy for the average user to load their data and run numerous reports. In R the soilDB similarly has many fetch functions to load data. Recently the soilReports package was developed to replicate NASIS’s report repository. Currently soilReports contains a small collection of reports designed to summarize field and laboratory pedons, and map unit polygons. This collection of reports automates many of methods reviewed thus far, but are no substitute for interacting with your data and asking questions. Examples can be found at the following link, https://github.com/ncss-tech/soilReports#example-output.
The soilReports R package is largely just a collection of R Markdown (.Rmd) reports. R Markdown is a document format that makes it easy to create reports and other dynamic documents. It allows R code and text to be mingled in the same document and executed like an R script. This allows R to generate reports similar to NASIS.
Lets get busy and demonstrate how to run an existing .Rmd to summarize a series of pedons for a given soil series.
Load your NASIS selected set. Run a query such as “POINT - Pedon/Site/NCSSlabdata by upedonid and Taxon Name” from the Region 11 report folder to load your selected set. Be sure to target both the site, pedon and lab layer tables. Remove from your selected set the pedons and sites you wish to exclude from the report.
Install/rei-install the soilReports package. This package is updated regularly (e.g. weekly), and should be re-installed from GitHub regularly.
# Install the soilReports package from GitHub
devtools::install_github("ncss-tech/soilReports", dependencies=FALSE, upgrade_dependencies=FALSE)
# Load the soilReports and rmarkdown package
library(soilReports)
library(rmarkdown)
# List reports
listReports()
# Copy report to your directory
copyReport(reportName = "region11/lab_summary_by_taxonname", outputDir = "C:/workspace2/lab_sum")
If none exists see the following job aid on how to create one, Assigning Generalized Horizon Labels.
Pay special attention to how carrot “^” and dollar “$” symbols are used. They function as anti-wildcards. For example, a “^” placed before an A horizon, “^A”, will match any horizon designation that follows A, such as Ap, Ap1, etc. Also, placing a “$” after a Bt horizon, “Bt$”, will match any horizon designation that precedes Bt, such as 2Bt or 3Bt. Encapsulating a horizon with both “^” and “$” symbols will result only in exact matches. For example ^A$, will only match A, not Ap or A1.
Command-line approach
# Set report parameters
series <- "Crosby"
genhz_rules <- "C:/workspace2/lab_sum/genhz_rules/Crosby_rules.R"
# report file path
report_path <- "C:/workspace2/lab_sum/report.Rmd"
# Run the report
render(input = report_path,
output_dir = "C:/workspace2",
output_file = "C:/workspace2/lab_sum.html",
envir = new.env(),
params = list(series = series,
genhz_rules = genhz_rules
)
)
Manual approach
Open the report.Rmd, hit the Knit drop down arrow, and select Knit with Parameters.
Sample pedon report
To summarize spatial data for a polygon, some form of zonal statistics can be used. Zonal statistics is a generic term for statistics that aggregate data for an area or zone (e.g., a set of map unit polygons). This can be accomplished via two methods. The most common method provided by most GIS is the census survey method, which computes a statistical summary of all the raster cells that overlap a polygon or map unit. This approach is generally faster and provides a complete summary of the spatial data. An alternative approach is the sample survey method, which takes a collection of random samples from each polygon or map unit. While the sample approach is generally slower and does not sample every cell that overlaps a polygon it does offer certain advantages. For example the census approach used by most GIS typically only provides basic statistics such as: the min, max, mean, standard deviation, and sum. However, for skewed data sets the mean and standard deviation may be unreliable. A better alternative for skewed data sets is to use non-parametric statistics like quantiles. Examples of non-parametric statistics are covered in Chapter 4. The advantage of the sample approach is that it allows us to utilize alternative statistics, such as quantiles, and perform a more thorough exploratory analysis. While some people might prefer the census approach because it provides a complete summary of all the data that overlaps a map unit, it is important to remember that all spatial data are only approximations of the physical world and therefore are only estimates themselves with varying levels of precision.
Before extracting spatial data for the purpose of spatial prediction, it is necessary that the data meet the following conditions:
Zonal statistics in R can be implemented using either the census or sample approach. While R can compute zonal statistics using the census approach with the zonal() function in the raster package, it is considerably faster to call another GIS via the RSAGA or spgrass6 packages. These additional GIS packages provide R functions to access SAGA and GRASS commands. For this example the RSAGA package will be used.
# library(rgdal)
#
# ca794 <- readOGR(dsn = "E:/geodata/project_data/8VIC/ca794", layer = "ca794") # import CA794 map unit polygons
# ca794 <- spTransform(ca794, CRS("+init=epsg:5070")) # reproject
# ca794$mukey2 <- as.integer(as.character(ca794$MUKEY)) # convert MUKEY to an integer
# writeOGR(ca794, dsn = "C:/workspace", layer = "ca794", driver = "ESRI Shapefile", overwrite_layer = TRUE) # export shapefile
#
# library(RSAGA)
# myenv <- rsaga.env(path = "C:/Program Files/QGIS Essen/apps/saga") # set rsaga path
# ned <- raster("E:/geodata/project_data/8VIC/sdat/ned30m_8VIC.sdat") # load DEM
# test <- raster(extent(ca794), ext = extent(ned), crs = crs(ned), res = res(ned)) # create a blank raster that matches the DEM
# writeRaster(test, file = "E:/geodata/project_data/8VIC/sdat/ca794.sdat", format = "SAGA", progress = "text", overwrite = TRUE) # export the raster
#
# # Convert the CA794 shapefile to a rsaga raster
#
# rsaga.geoprocessor("grid_gridding", 0, env = myenv, list(
# INPUT = "C:/workspace/ca794.shp",
# FIELD = "mukey2",
# OUTPUT = "2",
# TARGET = "0",
# GRID_TYPE = "2",
# USER_GRID = "E:/geodata/project_data/8VIC/sdat/ca794.sgrd",
# USER_XMIN = extent(test)[1] + 15,
# USER_XMAX = extent(test)[2] - 15,
# USER_YMIN = extent(test)[3] + 15,
# USER_YMAX = extent(test)[4] - 15,
# USER_SIZE = res(test)[1]
# )
# )
#
# # Extract the zonal statistics for 2 rasters
#
# rsaga.geoprocessor("statistics_grid", 5, env = myenv, list(
# ZONES = "E:/geodata/project_data/8VIC/sdat/ca794.sgrd",
# STATLIST = paste(c("E:/geodata/project_data/8VIC/sdat/ned30m_8VIC.sgrd", "E:/geodata/project_data/8VIC/sdat/ned30m_8VIC_slope5.sgrd"), collapse = ";"),
# OUTTAB = "C:/workspace/github/stats_for_soil_survey/trunk/data/ca794_zonal.csv"
# ))
#
test <- read.csv("C:/workspace2/github/stats_for_soil_survey/trunk/data/ca794_zonal.csv") # import the results
names(test)[1] <- "mukey" # rename the mukey column
subset(test, mukey == 2480977) # examine mukey 2480977
## mukey Count.UCU ned30m_8VICN ned30m_8VICMIN ned30m_8VICMAX
## 76 2480977 204460 204460 503.1204 1687.588
## ned30m_8VICMEAN ned30m_8VICSTDDEV ned30m_8VICSUM ned30m_8VIC_slope5N
## 76 942.5077 206.0661 192705133 204460
## ned30m_8VIC_slope5MIN ned30m_8VIC_slope5MAX ned30m_8VIC_slope5MEAN
## 76 0.195744 94.46768 38.52244
## ned30m_8VIC_slope5STDDEV ned30m_8VIC_slope5SUM
## 76 16.73909 7876299
To implement the sample approach to zonal statistics we can use the extract() and over() functions in the raster and sp packages respectively.
# # Take a stratified random sample
# s <- spsample(ca794, n = 1000, type = "stratified")
#
# setwd("E:/geodata/project_data/8VIC/ca794")
#
# # Create a raster stack
# rs <- stack(c(
# elev = "ned30m_8VIC.tif",
# slope = "ned30m_8VIC_slope5.tif")
# )
# # Set the spatial projection
# proj4string(rs) <- CRS("+init=epsg:5070")
#
# # Extract the map unit polygon value
# test1 <- over(s, ca794)
#
# # Extract the raster values
# test2 <- data.frame(extract(rs, s))
#
# # Combine the two datasets
# test2 <- cbind(test1, test2)
#
# # Cache/save the results
# save(test2, file = "C:/workspace/github/stats_for_soil_survey/trunk/data/ch2_sample.Rdata")
# Load the cache/saved results
load(file = "C:/workspace2/github/stats_for_soil_survey/trunk/data/ch2_sample.Rdata")
# Examine summary for mukey 2480977
summary(subset(test2, MUKEY == 2480977))
## AREASYMBOL SPATIALVER MUSYM MUKEY mukey2
## CA794:53 Min. :2 1255 :53 2480977:53 Min. :2480977
## 1st Qu.:2 1220 : 0 1910055: 0 1st Qu.:2480977
## Median :2 1225 : 0 1910056: 0 Median :2480977
## Mean :2 1230 : 0 1910058: 0 Mean :2480977
## 3rd Qu.:2 1240 : 0 1910059: 0 3rd Qu.:2480977
## Max. :2 1241 : 0 1910060: 0 Max. :2480977
## (Other): 0 (Other): 0
## elev slope
## Min. : 566.8 Min. : 9.731
## 1st Qu.: 801.2 1st Qu.:26.031
## Median : 930.8 Median :42.491
## Mean : 959.7 Mean :40.773
## 3rd Qu.:1123.6 3rd Qu.:53.470
## Max. :1467.1 Max. :71.997
##
Compare the results of the census and sample approaches above. While the census approach surveyed 204,460 cells, the sample approach only surveyed 5,777. However we can see that the results are largely similar between the two approaches.
The TEUI toolkit works in a similar manner to Zonal Statistics as Table within ArcGIS, with the added benefit of interactive graphics to aid in the assessment. TEUI is an ArcGIS Add-in and may be installed without Administrator privilege. Additional information and downloads are available here:
http://www.fs.fed.us/eng/rsac/programs/teui/downloads.html
One advantage of TEUI is the ability to output tabular data at both the map unit and polygon level with one operation. SECTIONS 2 - 5 of the TEUI User Guide are excerpted below.
SECTION 2. Creating and opening an existing project
The Toolkit requires the user to specify a folder location to store the database that contains the statistics and the file location of the data used to create them. This database can be an existing project folder location, but it is recommended that a new folder be created to reduce the number of extraneous files in a folder with other data.
Create a New Toolkit Project
On the TEUI Toolbar, select the Folder icon.
A dialog window will appear.
Select Browse…
In the Browse for Folder dialog, navigate to a location of your choice and select âMake New Folderâ
Type in an appropriate name for your project
Click OK
Open an Existing Toolkit Project
On the TEUI Toolbar, select the Folder icon
A dialog window will appear.
If the project was recent, click on the project in the Recent dialog box and click Open. If it is older and does not appear in the dialog box, select Browse and navigate to the project folder location. Select the project and click OK.
SECTION 3. Manage geospatial data and calculate statistics
Managing project data with version 5.0 is very simple. The user simply points the program to the file location of the data on your computer that is to be used in the analysis.
Note: If you move the geospatial data used in an existing Toolkit project, you will need to re-link the Toolkit to the data when opening that project.
Add Analysis Features to a Toolkit Project
R GUI image
Analysis Features: are zones within which you would like to generate statistics. These features can be feature classes such polygons or points (line features will be available soon) as a shapefile (.shp), as a feature class within a file geodatabase, or within a feature dataset within a file geodatabase. Analysis features can also be in the form of a discrete raster. The value field will be used as the identifier.
Rasters: are the raster data that are to be used to generate the descriptive statistics. An example would be a digital elevation model (DEM), percent slope, aspect, or land use for a discrete raster.
R GUI image
R GUI image
Note: All data layers to be analyzed (analysis features and raster data) must have the same geographic coordinate system and projection as each other. You will receive a warning if they are different.
Remove Analysis Features from a Toolkit Project
Add Analysis Features to the ArcMap Table of Contents
R GUI image
Choose a Map Unit
In some instances, such as TEUI mapping or Soil Survey, you may wish to identify individual polygon features as belonging to a specific map unit. This is simply a repeating identifying symbol or value which will be used to aggregate the statistics of the polygons belonging to that group. You do not need to select a map unit column in order to run the Toolkit. The symbol must be present as an attribute column in either the feature class layer or as an attribute in the VAT table of a discrete raster.
Select the Analysis Feature layer that you wish to add a Map Unit symbol for.
Select the appropriate map unit column attribute name for your map unit symbol in the drop down menu
R GUI image
Add Raster Data to the Toolkit Project
To add raster data to your Toolkit project, select the Add Layer button under the Raster heading
In the Browse to Raster file location dialog window, navigate to the file location of the raster data you wish to add to the Toolkit project.
R GUI image
Add Raster Data to the ArcMap Table of Contents
R GUI image
Click the Add to Table of Contents button.
To add multiple raster data layers at once, hold down the control button while selecting the individual layers you wish to add.
Calculate Statistics
R GUI image
R GUI image
R GUI image
SECTION 4. Create and Visualize Graphs and Tables
A primary Toolkit feature is the ability to view the generated statistics in both a graph and table format. The Toolkit creates zonal statistics for the analysis features and raster data selected in the data manager. The results can be visualized in graph form or as summary table both by individual feature or by map unit if one was chosen.
Open a graph window
R GUI image
R GUI image
R GUI image
R GUI image
MU Comparison: This data view type contains unique default graphs when comparing map units or other aggregated statistics types. These graphs have the mean (solid line), range (darker colored area), and standard deviation (lighter colored area) on the same graph, for each polygon within a map unit.
R GUI image
Chart type: You can choose the type of graph that best represents your data. The default is the area chart type.
Area chart
Bar chart
R GUI image
Point chart
R GUI image
Line chart
R GUI image
Radial chart
R GUI image
Feature Source: Choose which feature analysis layer you would like to graph. Polygons or discrete raster
R GUI image
Map Unit: Chose from which map unit you would like to select a polygon. Alternatively, select just the map unit you want to graph as a whole (i.e., donât select an individual feature below). Also, you can leave this option blank which is the default.
R GUI image
Feature: Select the individual feature you would like to view statistics for. This is the Feature ID or FID.
R GUI image
Raster Source: Select the raster layer you wish to view the statistics from.
R GUI image
Raster Band: Select the raster band from the raster layer.
R GUI image
Set Graph: Click once finished.
R GUI image
Common tasks with graphs
How do I add more individual features or map units to same graph? Within the Add Series pane, simply select a new feature source (if desired), map unit, feature, or raster source. Click Set Graph and the new feature or map unit will be added to the graph and the legend in a new color.
R GUI image
Can I add multiple axes to the graph? For example, a graph of elevation and percent slope?
To add another axis to your graph window, simply select the feature source, a map unit if desired, or a feature if desired, but select a new raster data source. You will notice in the graph legend map units and features are separated by the raster layer.
R GUI image
Is there a way to find a specific value on the graph?
If you hover over a specific spot in the graph, a line and window will appear with the specific values that your cursor is on. This is valuable if you are trying to identify outliers in your data.
R GUI image
Can I normalize the data ranges?
To normalize data so that the data ranges are comparable within the graph window, check the normalize button at the top of the graph.
R GUI image
How do I remove a graph series from the graph window?
Simply hit the red X next to the graph series you would like to remove in the graph legend.
R GUI image
How do I view the tabular data associated with each map unit or feature in the graph window?
To view the tabular and descriptive statistics of each map unit or feature in the graph window, simply select the table icon next to the data series in the graph legend.
R GUI image
SECTION 5. Exporting Statistics Data to Excel
The statistical data summaries produced by the toolkit can be exported to excel as a .CSV file for further use.
Export individual feature statistics
R GUI image
On the main menu bar, click the export table button with the green arrow. Note this may take a little while as the program is gathering up the data.
Navigate to the location you would like to save the .CSV file. Give the file a name and click Save.
R GUI image
Export map unit feature statistics
R GUI image
On the main Toolkit menu bar, click the export map unit table button with the blue arrow. Note this may take a little while as the program is gathering up the data.
Navigate to the location you would like to save the .CSV file. Give the file a name and click Save.
Gathering statistics of raster data cells for polygon data sets like SSURGO is typically achieved by the use of the Zonal Statistics as Table function from the Spatial Analyst toolbox. The output will be a tabular summary for the specified Zone, usually map unit symbol or individual polygons.
Example for summarizing by Map Unit Symbol:
Open the Zonal Statistics as Table Tool in the Zonal Toolbox.
R GUI image
The input zone field is MUSYM and the input raster is slope:
R GUI image
Which produces a table with the following output:
R GUI image
The use of the Mean and Standard Deviation are acceptable, provided the distributions are close to normal and the interest is in summarizing the entire population of map units. To get a closer look at individual polygons, summarize using the FID:
R GUI image
Which produces a table with the following output:
R GUI image
Use the Join capability to associate the table of statistics to the spatial data:
R GUI image
R GUI image
Which lets you view results by polygon and search for outliers or polygons that require further investigation:
R GUI image
In this example, 4 polygons of ChC2, Coshocton silt loam, 6 to 15 percent slopes, eroded, have an average slope greater than 15 percent. Discrepancies like this will need to be investigated and resolved.
R GUI image
In another example using a Box plot for assessment of a map unit with a slope class of 15 to 25 percent slopes, indicates half of the polygons with an average slope less than 15 percent:
R GUI image
FAO Corporate Document Repository. http://www.fao.org/docrep/field/003/AC175E/AC175E07.htm
Lane, D.M. Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/ Project Leader: David M. Lane, Rice University
Seltman, H. 2009. Experimental Design and Analysis. Chapter 4: Exploratory Data Analysis. Carnegie Mellon University. http://www.stat.cmu.edu/~hseltman/309/Book/
Vaughan, R., & Megown, K., 2015. The Terrestrial Ecological Unit Inventory (TEUI) Geospatial Toolkit: user guide v5.2. RSAC-10117-MAN1. Salt Lake City, UT: U.S. Department of Agriculture, Forest Service, Remote Sensing Applications Center. 40 p. http://www.fs.fed.us/eng/rsac/programs/teui/about.html
Tukey, John. 1977. Exploratory Data Analysis, Addison-Wesley
Tukey, J. 1980. We need both exploratory and confirmatory. The American Statistician, 34:1, 23-25.
Webster, R. 2001. Statistics to support soil research and their presentation. European Journal of Soil Science. 52:331-340. http://onlinelibrary.wiley.com/doi/10.1046/j.1365-2389.2001.00383.x/abstract
Peng, R. D., 2016. Exploratory Data Analysis with R. Leanpub. https://bookdown.org/rdpeng/exdata/
Wickham, H., and G. Grolemund, 2017, 7 - Exploratory Data Analysis, pp: 81-109. In H. Wickham and G. Grolemund. R for Data Science. O’Reilly Media Inc., Sebastopol, CA. http://r4ds.had.co.nz/exploratory-data-analysis.html
Healy, K., 2018. Data Visualization: a practical introduction. Princeton University Press. http://socviz.co/
Helsel, D.R., and R.M. Hirsch, 2002. Statistical Methods in Water Resources Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. 522 pages. http://pubs.usgs.gov/twri/twri4a3/